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Abstract 

This  paper  proposes  architectural  refinements,  server-driven  metadata  prefetching  and  namespace  flattening,  for  improving  the 
efficiency  of  small  file  workloads  in  object-based  storage  systems.  Server-driven  metadata  prefetching  consists  of  having  the 
metadata  server  provide  information  and  capabilities  for  multiple  objects,  rather  than  just  one,  in  response  to  each  lookup.  Doing  so 
allows  clients  to  access  the  contents  of  many  small  files  for  each  metadata  server  interaction,  reducing  access  latency  and  metadata 
server  load.  Namespace  flattening  encodes  the  directory  hierarchy  into  object  IDs  such  that  namespace  locality  translates  to 
object  ID  similarity.  Doing  so  exposes  namespace  relationships  among  objects  (e.g.,  as  hints  to  storage  devices),  improves  locality 
in  metadata  indices,  and  enables  use  of  ranges  for  exploiting  them.  Trace-driven  simulations  and  experiments  with  a  prototype 
implementation  show  significant  performance  benefits  for  small  file  workloads. 
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Figure  1 :  Direct  client  access.  First,  a  client  interacts  with  the  metadata  server  to  obtain  mapping  information  (e.g.,  which 
object  IDs  to  access  and  on  which  storage  devices)  and  capabilities  (i.e.,  evidence  of  access  rights).  Second,  the  client  interacts 
with  the  appropriate  storage  device(s)  to  read  and  write  data,  providing  a  capability  with  each  request. 


1  Introduction 

Scalable  storage  solutions  increasingly  rely  on  direct  client  access  to  achieve  high  bandwidth.  In  some  cases, 
direct  access  is  achieved  by  specialized  SAN  protocols  [2,  28,  16]  and,  in  others,  by  object-based  storage 
protocols  [12,  18,  23].  As  illustrated  in  Figure  1,  direct  client  access  offers  scalable  bandwidth  by  removing 
the  centralized  server  bottleneck,  shifting  metadata  management  out  of  the  critical  path.  As  a  result,  this 
storage  architecture  is  becoming  standard  in  high-end  scientific  computing. 

But,  it  is  destined  to  be  a  niche  architecture  unless  it  can  effectively  support  a  broader  range  of  work¬ 
loads,  despite  the  potential  additional  value  of  object-based  storage  for  document  management  [8,  26]  and 
automation  [11,  26].  Although  excellent  for  high-bandwidth  access  to  large  files,  direct  access  systems 
struggle  with  workloads  involving  access  to  many  small  files.  In  particular,  direct  client  access  to  each  file’s 
data  requires  first  accessing  the  metadata  server  (for  mapping  information  and  capabilities)  and  then  access¬ 
ing  the  storage  device.  With  large  files,  the  one-time-per-file  metadata  access  can  be  amortized  over  many 
data  accesses.  With  small  files,  however,  it  can  double  the  latency  for  data  access  and  become  a  system 
bottleneck. 

This  paper  proposes  architectural  refinements  to  increase  small  file  efficiency  in  such  systems.  At  a 
high  level,  the  approach  combines  restoring  lost  file  inter-relationships  with  metadata  prefetching.  When  a 
client  requests  metadata  for  one  file,  the  metadata  server  provides  mapping  information  and  capabilities  for  it 
and  other  related  files.  By  caching  them,  clients  can  potentially  eliminate  most  metadata  server  interactions 
for  small  file  data  access,  reducing  the  load  on  the  metadata  server  and  access  latencies  for  clients. 

As  with  any  prefetching,  a  key  challenge  is  identifying  strong  relationships  so  that  the  right  objects  are 
prefetched.  Today’s  file  systems  use  the  namespace  as  a  hint  regarding  inter-file  relationships,  organizing 
and  prefetching  metadata  and  data  accordingly.  But,  in  object-based  storage,  the  level  of  indirection  between 
file  naming  and  object  naming  obscures  this  hint.  We  restore  it  via  namespace  flattening,  or  encoding  a  file’s 
hierarchical  directory  position  into  the  object  ID  assigned  to  it.  Such  preservation  of  namespace  locality  in 
object  IDs  naturally  retains  this  traditional  hint  for  storage  devices,  provides  spatial  locality  in  the  metadata 
structures,  and  enables  compact  representations  for  groups  (i.e.,  ranges)  of  related  objects. 

This  paper  describes  the  architecture  and  protocol  changes  required  to  support  server-driven  metadata 
prefetching.  As  well,  efficiency  requires  capabilities  that  authorize  access  to  multiple  objects  and,  of  course, 
appropriate  client  cache  management.  No  interface  changes  are  required  for  namespace  flattening. 

Measurements  of  a  prototype  implementation  show  significant  benefits  for  workloads  dominated  by 
small  file  access.  Client  applications,  such  as  CVS  and  system  compilation,  can  achieve  significantly  higher 
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throughput.  Metadata  server  loads  deerease  by  27-94%,  whieh  would  allow  the  system  to  seale  1.4-17  x 
larger  before  faeing  metadata  server  sealability  problems.  In  addition  to  benehmark  experiments  with  the 
prototype,  analysis  of  real  NFS  traees  eonfirm  the  value  of  namespaee-based  prefetehing  and  the  potential 
metadata  server  load  reduetions.  In  these  traees,  5-34%  of  all  metadata  server  interaetions  ean  be  eliminated 
via  the  prefetehing  mechanisms. 

2  Small  files  and  object  stores 

This  section  reviews  object-based  storage,  its  struggles  with  small  files,  how  we  propose  to  address  the 
struggles,  and  related  work. 

2.1  Object- based  storage 

A  storage  object  is  a  byte-addressed  sequence  of  bytes,  plus  a  set  of  attributes,  accessed  via  a  file-like 
interface  (e.g.,  CREATE,  DELETE,  READ,  WRITE,  and  so  on).  An  object  store  is  much  like  a  filesystem,  but 
without  the  ASCII  names.  Objects  are  named  by  object  IDs  drawn  from  a  flat  numerical  namespace  (e.g., 
64-bit  numbers). 

Object-based  storage  was  originally  conceived  [12]  as  an  architecture  (illustrated  in  Figure  1)  for 
achieving  cost-effective  scalable  bandwidth  to  storage.  The  metadata  server  (called  a  “file  manager”  by 
Gibson  et  al.)  would  store  file  system  metadata  and  handle  metadata  actions,  such  as  creation,  deletion,  and 
lookup.  To  access  data,  a  client  would  fetch  mapping  information  and  capabilities  from  the  metadata  server 
and,  then,  read/write  data  directly  from/to  the  object  storage  devices.  By  doing  so,  clients  could  potentially 
exploit  the  full  switching  bandwidth  of  the  network  interconnect  for  their  data  accesses.  This  is  in  contrast 
to  the  conventional  server  model,  which  is  limited  by  the  bandwidth  available  from  a  file  server  interposed 
between  clients  and  many  disks. 

Object  storage  is  gaining  popularity  and  traction.  A  working  group  of  the  Storage  Networking  Industry 
Association  produced  a  draft  specification,  and  the  ANSI  TIO  body  has  reviewed  and  ratified  it  as  an  inter¬ 
face  standard  [23].  Research  on  object-based  storage  continues  [9,  20,  29]  and  early  products  [18,  24,  26] 
have  appeared.  In  addition  to  scalable  bandwidth,  some  are  beginning  to  exploit  object-based  storage  as  a 
mechanism  to  bundle  data  and  application-defined  attributes  for  long-term  maintenance  (e.g.,  for  regulatory 
compliance)  [8,  26].  But,  for  object-based  storage  to  be  viable  outside  of  niche  domains,  it  must  be  able  to 
support  small  file  workloads  effectively. 

2.2  Problems  for  small  files 

Although  not  required  by  the  architecture,  almost  all  object-based  storage  systems  map  each  file  to  an 
object  (or  multiple,  with  data  striped  among  them).  These  systems  struggle  with  workloads  that  access 
large  numbers  of  small  files,  such  as  software  development  and  user  workspaces,  for  two  reasons:  per-file 
metadata  server  interactions  and  loss  of  namespace  locality  at  the  storage  devices. 

Per-file  metadata  server  interactions:  To  access  a  file’s  data,  a  client  must  first  have  the  corresponding 
mapping  information  and  capabilities.  To  get  them,  the  client  must  interact  with  the  metadata  server.  Only 
then  can  the  client  communicate  directly  with  the  storage  devices  to  access  the  data. 

This  metadata  server  interaction  happens  once  for  each  access.  For  a  client  accessing  the  contents  of  a 
large  file,  this  interaction  is  usually  a  minor  overhead  amortized  over  many  data  accesses.  For  a  small  file, 
on  the  other  hand,  there  can  be  as  few  as  one  data  access.  Two  performance  problems  can  result:  increased 
latency  for  client  access  and  heavy  load  on  the  metadata  server.  Since  accessing  each  file’s  data  requires  first 
interacting  with  the  metadata  server,  client  latency  can  be  doubled  (two  RPC  roundtrips)  and  the  metadata 
server  can  be  asked  to  service  as  many  requests  as  all  the  storage  devices  combined. 
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Loss  of  storage  locality:  In  object-based  storage,  the  storage  devices  allocate  on-disk  locations  for 
the  objects  they  store.  The  performance  consequences  and  goals  for  this  are  essentially  the  same  as  those 
for  local  file  systems.  For  large  objects,  good  performance  will  usually  be  achieved  by  ensuring  that  each 
object’s  contents  are  placed  sequentially.  For  small  objects,  good  performance  requires  inter-object  locality, 
keeping  objects  that  are  likely  to  be  accessed  together  close  together  on  disk. 

Most  file  systems  achieve  small  file  localify  by  exploifing  fhe  relafionship  hinfs  exposed  by  fhe  direcfory 
sfrucfures.  Objecf-based  sforage  sysfems  make  fhis  difficulf  by  hiding  fhis  informafion  from  fhe  storage 
devices — only  fhe  mefadafa  server  knows  abouf  fhe  direcfory  sfrucfures.  The  sforage  devices  see  objecf  IDs 
insfead  and,  fhus,  can’f  effectively  employ  fhe  localify-enhancing  fricks  common  in  file  sysfems. 

2.3  Solving  the  problems 

Server-driven  mefadafa  prefefching  and  namespace  flaffening  can  address  fhe  above  problems  wifh  minimal 
changes  fo  fhe  objecf-based  sforage  archifecfure. 

Server-driven  metadata  prefetching:  Rafher  fhan  refurning  mefadafa  for  jusf  one  objecf,  when  queried, 
fhe  mefadafa  server  should  refurn  mefadafa  for  ofher  relafed  objecfs  as  well.  By  doing  so,  if  allows  fhe  clienf 
fo  populafe  ifs  cache  wifh  addifional  mapping  informafion  and  capabilities — a  form  of  prefefching,  buf 
orchesfrafed  by  fhe  mefadafa  server.  The  clienf  sfill  defermines  whaf  fo  keep  and  replace,  buf  fhe  server  de- 
fermines  whaf  fo  prefefch.  When  fhe  necessary  mefadafa  is  in  ifs  cache,  a  clienf  can  access  fhe  sforage  device 
immediafely.  Thus,  if  fhe  righf  addifional  mefadafa  is  refurned,  fhe  number  of  mefadafa  server  inferacfions 
should  drop  dramatically. 

The  common  model  of  prefefching  has  fhe  clienf  specify  whaf  fo  gef.  Server-driven  prefefching  is  more 
nafural  in  fhis  confexf  for  several  fundamenfal  and  pracfical  reasons.  Firsf,  and  perhaps  foremosf,  knowing 
whaf  fo  prefefch  requires  knowing  whaf  exisfs.  The  mefadafa  server  has  fhis  informafion  already  and  clienfs 
do  nof,  unless  fhey  frack  fhe  existence  and  infer-relafionships  of  files  redundanfly  (e.g.,  on  fheir  local  disk). 
This  is  in  confrasf  fo  large  file  sfreaming,  which  can  be  done  by  simply  asking  for  fhe  nexf  sequential  range 
of  dafa.  Second,  server  vendors  differentiafe  on  performance,  among  ofher  fhings.  Giving  fhe  server  confrol 
over  mefadafa  prefefching  increases  fhe  likelihood  fhaf  if  will  be  ufilized  and  funed  aggressively.  Third, 
fhe  mefadafa  server  knows  whaf  can  be  prefefched  wifh  minimal  cosf.  For  example,  if  can  choose  whaf  fo 
prefefch  based  on  whaf  mefadafa  blocks  are  in  ifs  cache  and  where  mefadafa  block  boundaries  are  locafed. 

We  promofe  server-driven  mefadafa  prefefching  in  fhis  paper,  because  if  requires  less  mechanism 
and  less  overhead  fhan  fradifional  clienf-driven  prefefching.  Buf,  fhe  key  insighf  is  fhaf  bafched  mefadafa 
prefefching  is  needed  fo  address  fhe  “per-file  mefadafa  server  inferacfions”  problem.  Clienf-driven  prefefch¬ 
ing  could  likely  be  engineered  fo  work  jusf  as  well,  wifh  enough  efforf  (e.g.,  duplicafion  of  mefadafa  af 
clienfs). 

Namespace  Flattening:  Namespace  flaffening  franslafes  fhe  fradifional  file  sysfem  approach  for  im¬ 
proving  small  file  performance  fo  objecf-based  sforage.  Rafher  fhan  assigning  objecf  IDs  via  some  namespace- 
independenf  allocafion  policy  (e.g.,  a  monofonically  increasing  number),  objecf  IDs  are  chosen  fo  reflecf 
localify  in  fhe  file  namespace.  This  is  analogous  fo  inode  number  selection  policies  in  file  sysfems,  which 
almosf  always  utilize  direcfory  sfrucfure  informafion  fo  enhance  localify.  In  facf,  objecf-based  sforage  has 
been  explained  [20]  as  spliffing  fhe  file  sysfem  af  fhe  inode  layer,  wifh  fhe  “upper  half”  being  af  fhe  mefadafa 
server  and  fhe  “lower  half”  being  af  fhe  objecf  sforage  device. 

Encoding  namespace  relafionships  info  objecf  IDs  provides  several  benefifs.  Firsf,  sforage  devices  can 
freaf  fhe  objecf  ID  as  a  localify  hinf,  wifh  closeness  indicafing  relafionships  fhaf  could  be  exploifed  for 
infernal  layouf  and  cache  managemenf  policies.  This  is  analogous  fo  how  mosf  file  sysfems  map  namespace 
localify  fo  block  number  localify  in  fheir  on-disk  layoufs.  Second,  index  sfrucfures  for  objecf  mefadafa  af 
sforage  devices,  which  are  fypically  organized  as  fables  or  B-frees  indexed  by  objecf  ID,  will  also  nafurally 
have  beffer  localify.  Third,  a  sef  of  relafed  files  can  be  idenfified  by  a  compacf  objecf  ID  range  rafher  fhan 
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an  enumerated  list. 

The  use  of  data  identifiers  as  hints  about  spatial  loeality  has  a  long  history  in  storage  systems — in  faet, 
it  is  the  foundation  on  whieh  disk  system  performanee  is  tuned.  File  systems  and  databases  assign  related 
data  to  numerieally  elose  bloek  numbers,  and  storage  deviees  provide  higher  performanee  for  aeeesses  to 
numerieally  elose  bloek  numbers.  Making  the  objeet  ID  a  hint  about  loeality  follows  this  same  approaeh, 
allowing  basie  (but  non-mandatory)  eooperation  with  no  ehanges  to  the  objeet  storage  interfaee  or  stan¬ 
dardization  of  hint  attributes.  As  with  bloek  number  loeality  in  traditional  storage,  the  only  effeet  of  any 
elient  or  storage  deviee  not  eonforming  to  the  implieit  eonvention  is  lower  performanee.  Also,  though  we 
promote  partieular  namespaee  flattening  sehemes  in  this  paper,  it  should  be  noted  that  the  eonvention  does 
not  speeify  how  objeet  IDs  are  assigned;  it  simply  suggests  that  numerieally  elose  objeet  ID  numbers  might 
indieate  loeality. 

2.4  Related  work 

This  seetion  diseusses  related  work.  Note  that  prefetehing  has  a  long  history  in  storage  systems,  and  we  will 
not  eover  all  sueh  work. 

Namespaee-based  loeality  has  long  been  reeognized  as  a  good  indieator  of  inter-file  relationships  and 
exploited  in  file  system  disk  layouts.  FFS  [19]  introdueed  the  eylinder  group,  whieh  many  file  systems  now 
eall  “alloeation  groups,”  as  a  meehanism  for  plaeing  related  file  system  struetures  in  a  eommon  disk  region 
with  a  few  simple  rules:  the  inode  for  a  new  file  is  alloeated  in  the  same  eylinder  group  as  the  direetory 
that  names  it,  and  the  first  few  data  bloeks  of  a  file  are  alloeated  in  the  same  eylinder  group  as  the  inode 
that  deseribes  it.  Until  spaee  within  in  a  eylinder  group  runs  short,  these  rules  effeetively  plaee  all  of  the 
metadata  and  data  for  all  small  files  within  any  given  direetory  in  a  small  region  of  the  disk.  C-FFS  [10] 
and  ReiserFS  [21]  go  further  by  eo-loeating  related  data  and  metadata  into  sequential  runs  of  bloeks  on  disk 
aeeording  to  namespaee  loeality,  rather  than  just  attempting  to  get  them  nearby. 

Namespaee-based  loeality  is  a  reasonable  assumption  when  no  other  information  about  future  aeeess 
patterns  is  available.  Many  have  explored  approaehes  to  using  applieation  hints  [4,  25]  and  observed  aeeess 
patterns  [13, 17]  for  eontrolling  prefetehing,  eaehing,  and  disk  layout.  These  approaehes  to  identifying  inter¬ 
file  relationships  ean  be  more  aeeurate  than  namespaee-based  loeality,  and  they  eould  be  used  with  server- 
driven  metadata  prefetehing  (instead  of  namespaee  flattening).  But,  namespaee-based  loeality  remains  the 
predominant  approaeh  in  real  systems  in  part  beeause  it  avoids  the  additional  meehanisms  and  (for  hints) 
APU-applieation  ehanges.  We  believe  that  enabling  exploitation  of  namespaee-based  loeality  is  the  right 
plaee  to  start,  given  the  minimal  ehanges  required. 

This  paper  proposes  a  new  approaeh  to  addressing  elient  lateney  and  metadata  server  load  for  objeet- 
based  storage  systems  handling  small-file  workloads.  At  least  three  approaehes  have  been  taken  to  address¬ 
ing  sueh  issues  in  other  systems.  First,  metadata  ean  be  partitioned  among  multiple  servers  [3,  30];  doing  so 
scales  throughput,  but  does  not  address  the  extra  roundtrip  or  locality  issues  and  it  requires  multi-server  con¬ 
sistency  for  some  operations  (e.g.,  RENAME  and  snapshot).  Second,  requests  can  be  batched  to  reduce  their 
quantity  (e.g.,  NFSv3’s  READDIRPLUS  or  NFSv4’s  compound  RPCs);  server-driven  metadata  prefetching 
can  be  viewed  as  a  form  of  batching,  though  orchestrated  by  the  server  rather  than  by  clients.  Third,  the 
metadata  server  could  store  the  data  for  small  files,  rather  than  using  objects  for  them;  this  could  eliminate 
the  extra  round-trip  associated  with  accessing  small  files,  but  it  would  exacerbate  rather  than  reduce  meta¬ 
data  server  load  issues.  These  three  approaches  are  complementary  to  the  approach  proposed  in  this  paper, 
rather  than  competitors,  and  large-scale  systems  will  likely  require  a  combination  of  several  of  them. 
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3  Improving  small  file  efficiency 


This  section  describes  the  mechanics  of  server-driven  metadata  prefetching  and  two  example  namespace 
flattening  algorithms. 

3.1  Server-driven  metadata  prefetching 

To  access  data  in  an  object-based  storage  system,  a  client  must  first  fetch  metadata  and  capabilities  from 
the  metadata  server;  only  then  can  it  interact  directly  with  the  object  storage  devices.  This  section  reviews 
object  metadata  and  describes  the  mechanics  of  server-driven  metadata  prefetching,  including  the  changes 
required  at  each  major  component  of  the  object-based  storage  architecture.  It  also  describes  multi-object 
capabilities  as  a  means  of  avoiding  increased  cryptographic  or  network  costs  for  prefetched  capabilities. 

3.1.1  Object  metadata  and  capabilities 

Metadata  for  each  object  includes  mapping  information  and  descriptive  information.  Capabilities  are  cre¬ 
dentials,  created  by  the  metadata  server,  that  can  be  shown  to  a  storage  device  to  demonstrate  a  client’s  right 
to  access  particular  objects. 

Mapping  information:  Mapping  information  describes  the  location(s)  of  data  corresponding  to  a  par¬ 
ticular  file  and,  if  more  than  one  location  is  involved,  how  data  is  spread  among  those  locations.  A  location 
is  composed  of  the  identity  of  a  storage  device  and  an  object  ID  on  that  device.  A  file’s  data  can  be  spread 
over  multiple  storage  devices  in  many  ways,  much  as  in  disk  arrays,  such  as  striping  with  parity  (RAID  5) 
or  replication. 

Descriptive  information:  Different  systems  store  different  descriptive  information  in  the  metadata 
managed  by  the  metadata  server.  Examples  includes  object  length,  access  control  lists  (ACEs),  and  ac¬ 
cess/modification  times.  ACEs  are  almost  always  managed  by  the  metadata  server,  since  this  information 
allows  it  to  determine  which  requests  to  service  and  which  capabilities  to  give  out.  Conversely,  authority 
over  length  and  time  values  may  lie  with  the  metadata  server  or  with  the  storage  devices.  The  former  is 
simpler  but  requires  extra  interactions  between  clients  and  the  metadata  server  in  order  to  update  these  val¬ 
ues.  The  latter  allows  the  values  to  be  updated  as  reads  and  writes  occur  to  the  storage  devices,  eliminating 
metadata  server  interactions  regarding  length  and  times.  But,  when  data  for  a  file  is  spread  across  several 
storage  devices,  obtaining  authoritative  values  for  length  and  times  is  simpler  if  the  metadata  server  manages 
them. 

Capabilities:  A  capability  provides  cryptographic  proof  to  a  storage  device  that  a  client  has  permission 
to  perform  a  particular  operation.  The  metadata  server  constructs  capabilities  for  clients  after  deciding  that 
the  client  should  have  access.  The  capability  generally  consists  of  a  MAC  of  the  mapping  information  and 
access  rights  conveyed,  as  well  as  freshness  fields  to  avoid  replay  attacks.  The  key  used  for  generating  the 
MAC  is  a  shared  secret  between  the  metadata  server  and  the  storage  device;  this  allows  the  storage  device 
to  verify  that  the  metadata  server  created  the  capability  as  no  other  entity  is  privy  to  the  shared  key. 

3.1.2  Changes  in  each  component 

The  object-based  storage  architecture  has  three  primary  components:  the  metadata  server,  clients,  and  stor¬ 
age  devices.  This  section  describes  how  each  changes  in  realizing  server-driven  metadata  prefetching. 

Metadata  server:  The  metadata  server  services  metadata  queries  and  updates  from  clients.  In  tradi¬ 
tional  object-based  storage  systems,  each  client  request  interacts  with  the  metadata  server  with  respect  to  one 
object.  Eor  server-driven  metadata  prefetching,  the  metadata  server  is  extended  to  respond  to  each  lookup 
request  with  the  metadata  and  capabilities  for  the  client-specified  object  as  well  as  other  objects  it  believes 
that  the  client  is  likely  to  access.  The  collective  response  is  thus  an  array,  rather  than  a  singleton. 
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It  is  desirable  to  minimize  prefetching  overheads  as  the  additional  responses  will  not  always  be  valu¬ 
able.  The  increased  network  bandwidth  can  be  reduced  by  compression  techniques,  if  needed,  but  should 
not  be  substantial  in  practice — metadata  is  small  even  relative  to  small  files.  Disk  and  cryptography  over¬ 
heads,  on  the  other  hand,  are  of  greater  concern.  In  terms  of  disk  overhead,  the  issue  is  extracting  the 
prefetched  metadata  from  their  persistent  data  structures  (e.g.,  per-object  inodes  in  a  table  or  B-tree).  A  syn¬ 
ergy  with  namespace  flattening  helps  here,  as  the  spatial  locality  it  creates  solves  this  problem — looking  up 
a  sequence  of  values  in  a  table  or  B-tree  is  very  efficient.  In  terms  of  cryptography  overhead,  each  capabil¬ 
ity  traditionally  provides  access  to  a  single  object,  which  means  that  the  prefetching  requires  generation  of 
many  capabilities.  Section  3.1.3  describes  a  way  of  mitigating  this  cost  by  having  each  capability  authorize 
access  to  multiple  objects. 

Clients:  The  primary  change  to  clients  is  that  they  now  receive  additional  responses  from  the  metadata 
server.  So,  they  should  cache  these  responses  in  addition  to  the  one  requested.  The  client  cache  management 
should  be  straightforward.  For  example,  the  client  could  maintain  two  distinct  caches  (one  for  requested 
entries  and  one  for  server-pushed  entries).  The  client  should  pay  attention  to  the  hit  rates  of  each,  however, 
to  avoid  using  too  much  space  when  the  server-pushed  entries  are  not  useful.  The  pathological  case  would  be 
a  working  set  that  would  just  barely  fit  in  the  full-sized  client  cache  but  no  longer  fits  because  of  uselessly 
prefetched  values.  A  well-constructed  client  cache  should  notice  that  the  server-pushed  entries  are  not 
useful,  given  the  workload,  and  retain  fewer  of  them. 

Storage  devices:  Metadata  prefetching  and  namespace  flattening  change  nothing  in  the  interfaces  or 
internal  functionality  of  storage  devices.  The  only  other  change  for  storage  devices  will  be  better  perfor¬ 
mance.  In  particular,  if  namespace  locality  corresponds  to  access  locality,  then  numerically  similar  object 
IDs  will  exhibit  temporal  locality.  This  locality  will,  in  turn,  translate  into  disk  locality  assuming  traditional 
FFS-like  disk  management  structures  for  objects  and  their  device-internal  metadata  (e.g.,  mappings  of  object 
offsets  to  disk  locations). 

3.1.3  Multi-object  capabilities 

Traditional  object  storage  uses  one  capability  to  authorize  access  to  each  object.  Thus,  the  proposed  meta¬ 
data  prefetching  for  N  objects  would  require  generation  of  N  capabilities.  Since  capability  generation  is 
a  non-trivial  computational  expense,  this  would  be  a  significant  overhead  when  some  of  the  prefetched 
information  ends  up  not  being  needed. 

We  propose  extending  the  model  so  that  a  capability  can  authorize  access  to  multiple  objects.  Multi¬ 
object  capabilities  will  reduce  prefetching  overhead  as  well  as  reducing  overall  metadata  server  CPU  load  by 
reducing  the  total  number  of  capabilities  generated.  It  is  crucial,  however,  that  the  capability’s  size  remain 
small — ^recall  that  clients  must  send  a  capability  as  part  of  every  request  to  an  object  storage  device.  Ideally, 
the  capability  would  be  compact  and  constant-sized,  regardless  of  the  number  of  objects  covered. 

This  ideal  can  be  achieved  for  capabilities  that  cover  a  range  of  object  IDs  in  which  the  mapping 
information  is  formulaic  and  the  access  rights  granted  are  consistent.  A  range  of  objects  can  be  specified 
with  just  start  and  end  object  IDs.  The  mapping  information  for  a  collection  of  related  objects  is  often 
similar  and  could  be  specified  as  a  list  of  storage  devices  and  a  simple  scheme  for  determining  location  (e.g., 
index  into  list  with  hash  of  object  ID).  Altogether,  such  a  capability  would  be  approximately  twice  the  size 
of  a  single  object  capability  (less  than  100  bytes).  ^ 

Ranges  are  a  natural  choice  when  namespace  flattening  is  employed,  since  sequences  of  object  IDs 
share  locality  in  the  directory  hierarchy.  A  user  that  has  permission  to  access  one  file  in  a  directory  almost 
always  has  like  permission  to  other  files  in  that  and  nearby  directories.  Note  that  the  permissions  in  question 

Of  the  metadata  server  manages  object  length  and  times,  rather  than  the  storage  devices,  they  will  be  returned  as  part  of  the 
metadata.  But,  they  need  not  be  part  of  the  capability. 
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are  for  the  elient  maehine,  so  many  file  permission  nuanees  (e.g.,  exeeute  permission)  do  not  affeet  the 
ability  to  generate  a  eapability  over  multiple  objeets. 

That  said,  eareful  definition  of  aeeess  rights  is  required  when  speeifying  the  objeets  eovered  with  a 
range.  Speeifieally,  the  aeeess  rights  must  rely  on  the  storage  deviee  to  assist  in  deeiding  whether  aeeess 
is  allowed.  For  reads,  the  aeeess  right  ean  be  “allow  read  aeeess  to  objeet  if  it  exists”  to  allow  sparse 
ranges.  For  over-writes,  the  aeeess  right  ean  be  “allow  write  aeeess  to  existing  bytes  in  existing  objeets”  to 
allow  writes  without  unbounded  spaee  growth.  When  eapaeity  usage  must  be  eontrolled  (e.g.,  by  quotas), 
individual  interaetions  with  the  metadata  server  would  be  required  when  ereating  new  data. 

The  one  additional  ehange  needed  for  multi-objeet  eapabilities  is  in  the  freshness  information.  In 
the  OSD  speeifieation  [23],  the  storage  deviee  stores  a  version  number  for  eaeh  objeet.  The  metadata  server 
updates  this  version  number  whenever  an  operation  might  restriet  a  previously-issued  eapability  for  an  objeet 
(e.g.,  the  objeet  is  deleted  or  aeeess  rights  ehange).  When  a  elient  requests  a  eapability,  the  metadata  server 
ineludes  this  version  number  in  the  eapability.  For  the  storage  deviee  to  verify  that  a  eapability  is  valid,  the 
version  number  the  elient  provides  in  the  eapability  must  equal  the  version  number  for  a  given  objeet  stored 
by  the  storage  deviee.  To  avoid  requiring  a  separate  version  number  for  eaeh  objeet  in  a  group  eapability, 
we  ehange  this  as  follows.  We  require  that  the  metadata  server  use  a  monotonieally  inereasing  version 
number  both  for  updating  the  version  number  at  the  storage  deviee  and  for  issuing  eapabilities.  We  require 
the  storage  deviee  to  verify  that  a  version  number  is  greater  than  rather  than  equal  to  the  version  number 
stored.  Given  these  two  ehanges,  a  group  eapability  will  now  eontain  a  version  number  that  is  valid  for  all 
objeets  updated  before  the  eapability  was  issued.  In  addition  to  being  eonstant-sized,  this  allows  a  eapability 
to  remain  valid  for  some  objeets  in  a  group  even  if  it  is  no  longer  eurrent  for  other  objeets. 

3.2  Namespace  flattening 

Namespaee  flattening  is  an  objeet  ID  assignment  strategy  that  eneodes  the  direetory  strueture  into  the  ob¬ 
jeet  ID  spaee.  For  eontext,  two  eommon  assignment  polieies  are  pseudo-random  and  ereate-order.  The 
pseudo-random  poliey  assigns  objeet  IDs  with  a  random  number  generator  or  by  eomputing  a  hash  of  the 
filename  or  eontents;  neither  preserves  any  loeality  information.  The  ereate-order  poliey  keeps  a  eounter 
and  simply  assigns  the  next  value  to  eaeh  ereated  objeet.  This  will  preserve  loeality  information  when  ere- 
ation  order  matehes  aeeess  order,  but  tends  to  suffer  from  fragmentation  over  time.  This  seetion  explains  the 
fragmentation  problem  in  more  detail  and  deseribes  some  namespaee  flattening  algorithms. 

3.2.1  Fragmentation  in  the  object  ID  space 

Object  ID  assignment  algorithms  are  subject  to  an  analogue  of  the  inode  fragmentation  problem  in  traditional 
block-based  filesystems,  as  analyzed  by  Smith  et  al.  [27].  Since  object  IDs  are  assigned  when  objects  are 
created,  subsequent  operations  that  modify  the  namespace  (e.g.,  CREATES  and  removes)  will  disrupt  the 
relationship  between  namespace  locality  and  object  ID  closeness  for  simple  policies  like  ereate-order.  Over 
time,  this  disruption  tends  to  slowly  randomize  the  relationship,  reducing  locality  to  the  extent  that  the 
namespace  is  a  good  predictor.  (Recall  that  all  modern  file  systems  rely  on  the  namespace  in  this  way.) 

A  simple  algorithm  such  as  ereate-order  suffers  especially  from  fragmentation  related  problems  because 
it  creates  a  perfectly  dense  object  ID  space,  leaving  no  room  for  growth  or  churn  (i.e.,  creations  and  deletions 
of  files)  in  directories.  For  example,  with  ereate-order,  two  files  created  in  the  same  directory  on  different 
days  are  likely  to  be  assigned  very  different  object  IDs.  The  graphs  in  Figure  2  illustrate  create-order’s 
susceptibility  to  fragmentation.  The  graphs  show  the  object  ID  assigned  by  ereate-order  versus  the  order  of 
traversal  when  using  the  f  ind  command  to  find  a  non-existent  file  in  a  linux  source  tree  checked  out  from  a 
CVS  repository.  The  leftmost  graph  shows  the  initial  state  of  the  linux  source  tree  as  soon  as  it  is  checked  out 
from  the  repository.  The  cvs  checkout  command  creates  files  in  depth-first  order,  so  the  assigned  object 
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Figure  2:  Fragmentation  over  time  in  the  object  ID  space  when  using  the  create-order  assignment  algorithm. 


IDs  exactly  match  the  order  of  traversal  used  hy  f  ind.  Thus,  for  depth- first  searches,  exactly  the  right  objects 
will  be  prefetched.  The  middle  graph  shows  the  state  of  the  checked  out  source  tree  after  fifteen  patches 
(from  2.4.0  to  2.4.15)  have  been  retrieved  from  the  CVS  repository  and  applied.  The  resulting  fragmentation 
is  clearly  visible — items  accessed  at  similar  times  (i.e.,  that  are  close  together  on  the  x-axis)  no  longer  have 
sequential  object  IDs.  The  state  after  thirty-one  patch  applications  shows  even  more  fragmentation. 

An  additional  source  of  fragmentation  can  be  RENAME  operations  that  move  files  across  direcfories, 
which  creates  a  mismafch  befween  fhe  original  allocafion  choices  and  fhe  new  namespace  locafion.  Forfu- 
nafely,  such  operations  are  exfremely  rare  in  pracfice  and  should  nol  be  significanf  -  analysis  of  fhe  fraces 
used  in  Secfion  5.2  indicate  fhey  comprise  fewer  fhan  0.001%  of  all  operafions. 

Sysfems  can  cope  wifh  fragmenfafion  proactively  or  reacfively.  Mosf  file  sysfems  proacfively  segmenf 
fhe  ID  space  (e.g.,  via  allocafion  groups)  fo  create  sparseness  and  separate  unrelated  growfh  and  churn. 
Related  objecfs  can  be  mafched  fo  fhe  same  segmenf  of  fhe  ID  space,  and  allocalions/de-allocafions  wifhin 
fhaf  segmenf  will  nof  affecf  ofhers.  Some  file  sysfems  also  reacfively  “defragmenf”  by  periodically  sweeping 
fhrough  and  changing  fhe  assignmenfs  fo  mafch  fhe  ideal.  Such  defragmenfalion  lends  fo  be  difficull  in  large- 
scale,  dislribuled  slorage  sysfems  (as  conlrasled  wifh  desklop  file  sysfems),  bul  if  does  complemenl  proactive 
techniques. 


Method 

/home/ 

/home/person/ 

/home/person/submit.ps 

/home/person/submit.pdf 

Child-closest 

Cousin-closest 

0x0400  0000 

0x0000  0040 

0x04a0  0000 

0x0000  04a0 

0x04a0  0007 

0x0000  04a7 

0x04a0  0008 

0x0000  04a8 

Table  1 1  Namespace  flattening  example.  This  table  illustrates  the  two  namespace  flattening  policies  by  showing  object  IDs  that 
might  be  used  for  some  files  in  a  directory  hierarchy.  All  numbers  are  in  hexadecimal. 


3.2.2  Namespace  flattening  algorithms 

This  section  describes  two  namespace  flattening  algorithms  for  assigning  object  IDs  and  proactively  avoid¬ 
ing  fragmentation.  Both  segment  the  object  ID  into  slots  for  each  directory  depth  and  avoid  fragmentation 
by  keeping  each  directory’s  contents  within  its  slot.  The  first,  child-closest,  assigns  the  depths  statically, 
explicitly  placing  subtrees  (children)  next  to  one  another  in  the  OID  space.  The  second,  cousin-closest,  uses 
shifting  to  place  directories  on  the  same  hierarchy  level  (cousins)  close  together  in  the  OID  space.  Table  1 
illustrates  both,  showing  how  some  file  names  might  translate  to  OIDs.  The  table  and  all  examples  in  this 
section  assume  32-bit  OIDs  (for  presentation)  shown  in  hexadecimal. 

Child-Closest:  The  child-closest  algorithm  assigns  a  slot  to  each  level  of  the  directory  hierarchy,  mov¬ 
ing  from  left  to  right,  and  numbers  to  each  file  or  directory  within  a  directory.  For  example,  if  /home/  is 
assigned  0x4  and  its  subdirectory  person/  is  assigned  Oxa,  the  id  for  /home/person/submit  .pdf  would 
start  as  0x04a .  .  . ,  assuming  that  each  slot  is  four  bits  in  size.  The  file  number  grows  from  right  to  left.  If 
submit .  pdf  is  assigned  0x8,  then  the  object  ID  of  /home/person/submit .  pdf  would  be  0x04a0  0008. 
The  object  containing  the  contents  of  a  directory  is  always  the  zeroth  file  in  that  directory,  assigned  file 
number  0x0  within  the  directory’s  number  space. 

The  child-closest  algorithm  will  perform  particularly  well  for  depth-first  traversals  (e.g.,  as  exhibited 
by  the  find  command)  and  workloads  with  similar  access  patterns.  Specifically,  with  this  algorithm,  the 
closeness  of  the  object  ID  of  a  directory  and  a  file  is  a  function  of  their  vertical  distance  in  the  directory 
hierarchy,  followed  by  their  horizontal  distance;  the  immediate  descendants  of  a  directory  are  assigned 
OIDs  that  are  very  close  to  that  of  the  directory  itself  and  other  descendants  are  assigned  OIDs  that  are 
decreasingly  close.  An  additional  feature  of  this  policy  is  that  it  can  represent  an  entire  subtrees  in  a  single 
range.  For  example,  the  range  0xae410000 — 0xae41FFFF  might  describe  the  entire  subtree  rooted  at 
/home/person/ cvs/pro j  ect/ src. 

Cousin-Closest:  The  cousin-closest  algorithm  also  uses  slots  and  assigns  a  number  to  each  file  and 
subdirectory  within  a  directory.  But,  its  object  IDs  grow  from  right  to  left  by  shifting  the  parent  directory 
bits  over  one  slot.  For  example,  if  /home/  is  assigned  0x4,  person/  is  assigned  Oxa,  and  submit.pdf 
is  assigned  0x8,  then  the  object  ID  of  /home/person/submit .pdf  would  be  0x0000  04a8.  As  with 
child-closest,  the  object  containing  the  contents  of  the  directory  is  the  zeroth  file  in  that  directory. 

The  cousin-closest  algorithm  will  perform  particularly  well  for  breadth-first  traversals  and  workloads 
with  similar  access  patterns.  With  this  algorithm,  the  closeness  of  the  object  ID  of  a  directory  and  a  file 
is  a  function  of  their  horizontal  distance  in  the  directory  hierarchy,  followed  by  the  vertical  distance.  An 
additional  feature  of  cousin-closest  is  that  it  tends  to  produce  denser  representations  of  directory  hierarchies. 

3.2.3  Excessively  deep  and  wide  hierarchies 

Both  namespace  flattening  policies  use  statically- sized  slots  for  each  directory  and  file  number.  This  means 
that  it  is  possible  to  have  too  much  width  (running  out  of  numbers  within  a  slot)  or  too  much  depth  (running 
out  of  slots).  To  address  overflow,  we  partition  the  object  ID  space  into  four  distinct  regions  by  using  the 
two  high-order  bits.  For  concreteness,  the  descriptions  below  provide  examples  based  on  a  64-bit  object  ID 
space  in  which  two  bits  are  used  to  specify  the  region,  leaving  62  bits  within  each  region. 
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Primary  region:  The  primary  region  uses  one  of  the  namespaee  flattening  polieies  for  non-overflow 
files  and  direetories.  In  this  example,  lets  assume  three  bits  for  eaeh  direetory  slot  and  eight  bits  for  the  file 
number  within  a  direetory.  This  primary  region  accommodates  18  levels  of  directories  and  255  files  within 
any  directory. 

Deep  region:  If  a  directory  in  the  directory  hierarchy  is  too  deep  (greater  than  1 8  levels,  in  the  exam¬ 
ple),  an  unallocated  segment  of  the  deep  region  is  used.  The  deep  region  contains  a  prefix  (e.g.,  27  bits)  and 
a  slotted  region  (e.g.,  one  file  slot  and  9  directory  slots).  Prefixes  are  allocated  in  sequential  order,  and  a  new 
one  becomes  the  root  of  a  new  slotted  region  that  grows  downward.  Locality  is  lost  between  the  re-rooted 
directory  and  its  parent,  but  locality  with  its  subdirectories  is  maintained.  Despite  its  name,  the  deep  region 
is  also  used  for  excess  subdirectories  in  a  directory.  In  the  example,  only  8  directories  can  be  numbered 
within  a  slot.  The  ninth  and  beyond  are  re-rooted  to  a  deep  region  prefix,  as  described  above. 

Wide  region  (0x2):  If  a  directory  has  too  many  files  (more  than  255,  in  this  example),  an  unallocated 
segment  in  the  wide  region  is  used  for  the  overflow  from  that  directory.  The  wide  region  is  divided  into  a 
prefix  (e.g.,  40  bits)  and  a  large  file  segment  (e.g.,  22  bits).  Using  the  sample  numbers,  four  million  files 
can  be  grouped  under  each  of  the  trillion  prefixes.  As  with  the  deep  region,  ID  locality  is  lost  between  the 
original  directory’s  object  ID  and  the  files  rooted  under  the  wide  region.  But,  for  very  wide  directories,  this 
is  unlikely  to  be  a  significant  penalty. 

Final  region  (0x3):  If  one  runs  out  of  prefixes  in  either  the  wide  region  or  the  deep  region,  the  final 
region  is  used.  Object  IDs  in  the  final  region  (2^^,  in  the  example)  are  assigned  via  create-order. 

To  guide  slot  size  choices  and  explore  the  extent  of  overflow  expected  in  a  reasonably  sized  filesystem, 
we  studied  a  departmental  lab  server  that  houses  the  home  directories  and  software  development  activities 
of  about  twenty  graduate  students,  faculty,  and  staff.  On  this  server,  95%  of  all  directories  contain  fewer 
than  eight  directories  and  47  files  (99%  have  fewer  than  127).  Also,  95%  of  files  and  directories  are  fewer 
than  16  levels  of  directories  away  from  the  root  directory.  Thus,  for  this  server,  overflow  would  be  rare  for 
the  example  above. 

4  Experimental  Apparatus 

This  section  describes  the  prototype  implementation  and  trace-driven  simulator  used  to  evaluate  the  pro¬ 
posed  techniques. 

4.1  Prototype  object-based  storage  system:  Ursa  Minor 

We  implemented  both  namespace  flattening  algorithms  and  server-driven  metadata  prefetching  in  an  object- 
based  storage  system  called  Ursa  Minor.  A  detailed  description  of  Ursa  Minor  is  available  in  [1],  and  it 
conforms  to  the  basic  architecture  illustrated  in  Figure  1 .  Its  centralized  metadata  server  manages  metadata 
and  distributes  capabilities  for  given  object  IDs.  The  storage  devices,  implemented  as  application-level 
software,  store  object  data,  and  clients  access  this  data.  There  is  also  an  NFS  server  that  acts  as  an  object 
storage  client  on  behalf  of  unmodified  clients. 

Server-driven  metadata  prefetching  involved  changes  to  the  metadata  server  and  the  client  metadata 
cache.  Since  the  metadata  server  stores  metadata  in  a  B-tree  indexed  by  object  ID,  it  was  easy  to  have 
it  prefetch  the  remaining  items  in  the  B-tree  page  containing  the  metadata  requested  by  the  client.  If  the 
prefetch  size  (i.e.,  number  of  additional  objects  for  which  to  prefetch  metadata,  set  to  32  in  our  experiments) 
is  greater  than  the  number  of  metadata  items  contained  in  this  page,  the  appropriate  number  of  items  from 
surrounding  pages  are  also  prefetched.  Our  current  implementation  does  not  support  multi-object  capabil¬ 
ities  or  metadata  compression,  so  replies  to  metadata  requests  are  just  lists  that  contain  both  the  requested 
and  prefetched  metadata.  The  client  metadata  cache  was  replaced  with  a  segmented  LRU  cache  [  14]  con- 
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taining  both  a  primary  and  a  prefetch  segment.  Demand-fetched  entries  are  inserted  into  the  primary  cache 
and  evicted  using  a  LRU  policy.  Prefetched  entries  are  inserted  into  the  prefetch  cache  and  moved  to  the 
front  of  the  primary  cache  if  used  before  being  evicted. 

We  implemented  the  namespace  flattening  algorithms  in  the  NFS  server.  The  NFS  server  translates 
NFS  requests  for  filehandle  data  into  requests  for  object  data  that  can  be  understood  by  the  underlying  object 
storage  system.  A  one-to-one  mapping  between  objects  and  filehandles  is  maintained  by  the  NFS  server, 
and  object  IDs  are  assigned  at  time  of  creation.  Specifically,  upon  receiving  a  CREATE  or  MKDIR  request, 
the  NFS  server  allocates  an  object  ID,  creates  the  object,  and  constructs  an  NFS  filehandle  that  includes  the 
new  object  ID.  Object  IDs  in  the  system  are  128  bits  in  size,  with  the  high  32  bits  dedicated  to  specifying  a 
partition  number.  Thus,  the  namespace  flattening  algorithms  have  96  bits  to  work  with.  However,  given  the 
64-bit  object  ID  in  the  OSD  specification,  we  limit  the  system  to  using  64  bits.  We  chose  10  bits  for  the  file 
number  and  5  bits  for  each  of  the  10  directory  slots.  ^ 

Ursa  Minor’s  metadata  accesses  can  be  classified  into  two  categories:  mandatory  and  non-mandatory. 
Mandatory  accesses  are  accesses  due  to  operations  that  must  be  propagated  to  the  metadata  server  imme¬ 
diately  so  as  to  guarantee  a  consistent  namespace  view  for  all  clients.  Specifically,  they  are  accesses  due 
to  operations  that  modify  the  namespace  (e.g.,  CREATE,  RENAME,  and  REMOVE).  Since,  in  this  system, 
the  metadata  server  manages  length  information,  APPEND  and  TRUNCATE  operations  also  incur  mandatory 
accesses.  All  other  operations  incur  non-mandatory  accesses  (e.g.,  READ,  WRITE,  etc.).  Namespace  flat¬ 
tening  and  server-driven  metadata  prefetching  can  eliminate  non-mandatory  accesses,  but  cannot  prevent 
mandatory  accesses. 

4.2  Trace-driven  object  storage  simulator 

We  extended  an  object-based  storage  simulator  to  allow  evaluation  with  large  traces  of  real  workloads  and 
fewer  implementation  artifacts.  The  simulator  takes  as  input  an  NFS  trace  and  outputs  the  number  of  meta¬ 
data  cache  accesses  incurred  by  each  client,  which  is  an  indication  of  the  load  placed  on  the  metadata  server 
and  required  client  latency. 

The  simulator  proceeds  in  two  phases:  reconstruction  and  simulation.  The  reconstruction  phase  scans 
a  trace  to  recreate  the  state  of  the  file  system,  as  much  as  possible,  at  the  time  before  the  simulated  portion 
of  the  trace.  It  uses  the  information  yielded  by  traced  operations  (e.g.,  CREATE,  LOOKUP,  and  READDIR)  to 
reconstruct  the  namespace.  At  the  end  of  this  phase,  object  IDs  are  assigned  to  all  items  in  the  reconstructed 
namespace  according  to  whichever  policy  is  used. 

The  reconstructed  namespace  will  be  imperfect.  Most  importantly,  only  files  and  directories  that  were 
accessed  are  visible  in  the  trace,  and  so  unused  parts  of  the  original  file  system  will  be  absent.  This  will  tend 
to  make  the  file  system  look  smaller  and  more  densely  accessed.  In  addition,  the  creation  order  of  files  that 
exist  before  the  trace  began  cannot  be  known.  It  can  be  predicted  with  the  modification  time  available  for 
most  accessed  files,  given  that  most  files  are  created  and  written  in  their  entirity  in  the  traced  environment. 
We  believe  that,  despite  such  limitations,  the  simulation  results  represent  reasonable  expectations  for  the 
traced  environments. 

The  simulation  phase  models  client-metadata  interactions  for  the  reconstructed  file  system.  It  simulates 
a  metadata  cache  (using  segmented  LRU)  for  each  client.  For  each  file  accessed  in  the  trace,  a  check  is  made 
to  see  if  metadata  for  that  file  exists  in  the  appropriate  client’s  metadata  cache.  If  so,  the  number  of  cache  hits 
is  incremented.  If  not,  the  number  of  metadata  accesses  is  incremented  and  metadata  for  the  surrounding 
N  object  IDs  are  prefetched  into  the  client’s  metadata  cache.  Accesses  to  files  not  in  the  reconstructed 
namespace  are  ignored,  as  where  they  fit  is  unclear.  Such  files  account  for  less  than  2%  of  all  files  accessed 

^Our  implementation  of  namespace  flattening  algorithms  uses  4  bits  for  the  region  number,  even  though  only  2  bits  are  actually 
needed.  This  is  an  artifact  of  our  current  implementation. 
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in  most  of  the  traces  analyzed,  however,  in  one  trace  (EECS03),  they  account  for  25%.  Eiles  created  during 
simulation  are  added  to  the  reconstructed  namespace  and  assigned  an  object  ID. 

Eike  the  object-based  storage  system,  metadata  accesses  in  the  simulation  phase  are  categorized  as  ei¬ 
ther  mandatory  or  non-mandatory.  Unlike  in  the  prototype  system,  however,  clients  are  granted  the  authority 
to  manage  the  length  of  a  file  themselves  for  a  certain  amount  of  time  without  propagating  updates  to  the 
metadata  server.  As  a  result,  mandatory  accesses  only  include  those  accesses  that  result  from  operations 
that  modify  the  namespace  (e.g.,  CREATE,  RENAME,  etc.).  All  other  operations,  including  length  updates, 
are  modeled  as  non-mandatory  accesses.  This  model  of  file  lengfh  managemenf  more  accurafely  reflecls 
exisfing  objecf  storage  systems  [12,  24]. 

5  Evaluation 

We  evaluafed  server-driven  mefadafa  prefefching  and  namespace  llalfening  algorifhms  in  fwo  ways.  To 
quantify  fhe  benefifs  of  bofh  techniques  on  an  acfual  system,  we  ran  several  benchmarks  on  fhe  modified 
version  of  Ursa  Minor  (Secfion  5.1).  To  defermine  fhe  effecls  of  fragmenlafion  on  fhe  objecf  ID  assignmenf 
algorifhms,  we  also  evaluafed  bofh  fechniques  via  frace  replay  of  large,  real,  NES  fraces  using  our  objecf- 
based  sforage  simulafor  (Section  5.2).  Eor  bofh  evaluafions,  we  reporf  fhe  reducfion  in  accesses  to  fhe 
mefadafa  server  as  compared  fo  fhe  case  in  which  no  prefefching  is  performed — fhis  is  a  measure  of  bofh 
fhe  decrease  in  end-fo-end  lafency  seen  by  clienfs  and  work  saved  af  fhe  mefadafa  server. 

5.1  Evaluation  using  Ursa  Minor 

We  ran  four  benchmarks  on  Ursa  Minor  fo  defermine  fhe  benefifs  of  prefefching  using  fhe  child-closesf  and 
cousin-closesf  namespace  flaffening  algorifhms  compared  fo  when  no  prefefching  is  performed.  Eor  compar¬ 
ison  purposes,  we  also  implemented  fhe  creafe-order  assignmenf  algorifhm,  which  assigns  a  monofonically 
increasing  objecf  ID  fo  objecfs  when  creafed.  The  benchmarks  were  run  using  a  folal  of  four  machines.  Two 
were  run  as  sforage  devices — one  sfored  dafa  and  fhe  of  her  mefadafa.  The  NES  Server  and  benchmark  were 
co-locafed  on  fhe  same  machine  and  communicated  via  fhe  loopback  nefwork  interface.  This  sefup  emu- 
lafes  direcf  clienf  access,  albeif  wifh  additional  soflware  overhead.  Einally,  a  single  machine  was  dedicafed 
fo  fhe  mefadafa  server.  Each  machine  confained  a  3.0  GHz  Penfium  4  processor,  a  Intel  Pro  1000  Nefwork 
Card,  and  four  230  gigabyte  Wesfern  Digifal  WD2500  SATA  disk  drives  fhaf  ran  on  a  3ware  9000  series 
RAID  confroller  in  JBOD  mode.  Thirfy-fwo  ifems  were  prefelched  on  every  access  fo  fhe  mefadafa  server 
and  fhe  segmenfed  ERU  clienf  mefadafa  cache  was  configured  fo  use  a  2,000  enfry  demand  cache  and  a 
1,000  enfry  prefefch  cache. 

The  four  benchmarks  used  for  evaluafion  are  lisfed  below.  We  used  fhese  because  more  popular  bench¬ 
marks  do  nol  exhibif  infer-file  localify.  Eor  example,  Posfmark  uses  a  random  number  generafor  fo  selecf 
which  file  to  access  nexf  [15],  and  loZone  uses  only  a  single  file  for  ifs  benchmarking  [22].  Eor  fhese 
synfhefic  workloads,  no  nofeworlhy  benefifs  are  seen  from  namespace  flaffening  and  mefadafa  prefefching. 
Each  of  our  cusfom  benchmarks  represenfs  a  specific,  buf  common,  use  of  a  filesystem. 

Tar:  This  benchmark  consisfs  of  unfarring  fhe  Einux  2.4  source  free.  The  pofenfial  benefif  from 
prefefching  is  very  limifed  because  fhis  benchmark’s  mefadafa  inferacfions  are  almost  all  creates,  which 
are  mandatory  accesses  that  cannot  be  eliminated. 

Build:  This  benchmark  consists  of  building  the  Einux  2.4  source.  This  involves  reading  the  source 
files  and  creating  object  files.  Due  to  the  reads  involved,  the  maximum  potential  benefit  from  prefetching  is 
greater  than  in  Tar.  Still,  the  maximum  benefit  is  limited  by  the  large  number  of  mandatory  create  operations 
for  the  object  files. 

Patch/rebuild:  This  benchmark  consists  of  patching  the  Einux  2.4.31  source  to  Einux  2.4.32  and 
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Figure  3:  Benefit  obtained  by  using  the  various  object  ID  assignment  algorithms  and  server-driven  metadata  prefetching 
in  Ursa  Minor  (lower  is  better).  These  graphs  show  the  decrease  in  metadata  accesses  when  prefetching  is  performed  while  using 
the  various  object  ID  assignment  algorithms  as  compared  to  the  baseline  (in  which  no  prefetching  is  performed)  for  the  benchmarks 
considered. 

rebuilding.  The  maximum  potential  benefit  from  prefetehing  is  greater  than  in  Build  beeause  the  re-build 
phase  ereates  fewer  files. 

Search:  This  benehmark  eonsisfs  of  a  deplh-firsl  seareh  for  a  non-exisfenf  word  in  fhe  Linux  2.4  souree 
free  using  fhe  eommand  f  ind  .  -type  f  |  xargs  grep  <N0NEXISTENT>.  Sinee  ifs  workload  is  read¬ 
only,  fhis  benehmark  involves  no  mandafory  aeeesses  fo  fhe  mefadafa  server.  The  reduefion  in  mefadafa 
aeeesses  is  limifed  only  by  fhe  number  of  items  prefefehed  and  fhe  effieaey  of  fhe  objeef  ID  assignmenf 
algorifhm  used.  As  sueh,  eompared  fo  fhe  ofher  benehmarks,  Search  exhibifs  fhe  greafesf  pofenfial  for 
reduefion  in  mefadafa  server  aeeesses. 

Figure  3  shows  fhe  resulfs  of  fhe  benehmarks.  The  resulfs  show  fhaf,  in  Search  where  fhe  pofenfial 
for  benefil  is  greafesf,  server-driven  mefadafa  prefefehing  and  use  of  any  of  fhe  fhree  possible  objeef  ID  as- 
signmenf  algorifhms  eliminafes  94%  of  all  mefadafa  aeeesses.  In  Patch/rebuild,  89%  of  non-mandalory 
aeeesses  and  58%  of  all  mefadafa  aeeesses  are  eliminated  by  fhe  ehild-elosesf  and  eousin-elosesf  algo- 
rifhms;  slighfly  fewer  aeeesses  are  eliminated  by  fhe  ereafe-order  algorifhm.  Beeause  71%  of  all  aeeesses 
are  mandatory  in  Build,  fhe  pofenfial  for  benefif  is  limifed.  However,  prefefehing  using  fhe  namespaee  flaf- 
fening  algorifhms  eliminafes  27%  of  all  aeeesses  in  fhis  benehmark  and  75%  of  all  non-mandalory  aeeesses. 
Creale-order  eliminafes  9%  fewer  non-mandalory  aeeesses  lhan  fhe  namespaee  fialfening  algorifhms,  bul 
fhis  yields  only  a  2%  differenee  in  lofal  mefadafa  aeeess  prevenfed.  Sinee  all  aeeesses  in  Tar  are  mandafory, 
prefetehing  yields  no  benefil  in  fhis  benchmark. 

The  creale-order,  child-closesl,  and  cousin-closes!  algorifhms  all  perform  similarly  in  Ihese  bench¬ 
marks.  The  creale-order  algorifhm  performs  well  because  none  of  fhe  benchmarks  exhibil  enough  concur¬ 
rency  or  create  enough  additional  files  fo  fragmenl  fhe  objeef  ID  space  during  Iheir  brief  runlimes.  Even 
so,  some  difference  is  visible — fhe  creale-order  algorifhm  performs  sighlly  worse  lhan  fhe  child-closesl  or 
eousin-elosesf  algorifhms  in  Build  and  Patch. 

5.2  Evaluation  using  the  trace-driven  object  storage  simulator 

In  addition  to  evaluating  server-driven  mefadafa  prefefehing  and  namespace  flattening  algorifhms  in  fhe 
modified  objecl-slorage  syslem,  we  also  performed  a  Irace-based  evaluation  of  Ihese  fechniques  using  our 
objecl-based  slorage  simulalor.  For  fhis  evaluation,  fhe  simulator  was  configured  to  use  fhe  same  prefelch 
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size  and  cache  configuration  as  that  used  for  the  experiments  run  on  Ursa  Minor  (32  items  were  prefetched 
on  every  cache  miss,  and  the  segmented  LRU  client  metadata  cache  was  configured  to  use  a  2,000  entry 
demand  cache  and  a  1,000  entry  prefetch  cache).  Since  the  creation  order  of  files  cannot  be  known  for 
files  that  were  created  before  the  start  of  the  trace  period,  the  create-order  algorithm  was  approximated 
using  the  earliest  modification  time  seen  for  each  file  in  the  trace.  Finally,  for  comparison  purposes,  we 
also  implemented  a  pseudo-random  object  ID  assignment  algorithm.  We  expected  this  algorithm  to  perform 
worse  than  any  of  the  other  assignment  algorithms  and,  when  significant  cache  pressure  exists,  worse  than 
when  no  prefetching  is  performed. 

Section  5.2.1,  describes  the  traces  used  for  this  evaluation.  Section  5.2.2  discusses  the  aggregate  results 
obtained.  Section  5.2.3  discusses  the  effects  of  individual  client  workloads  on  the  aggregate  benefit  achieved. 

5.2.1  Traces  used 

Three  NFS  traces  from  Harvard  University  were  used  in  this  trace -based  evaluation.  We  describe  each  trace 
and  its  constituent  workload  below. 

EECS03:  The  EECS03  trace  captures  NES  traffic  observed  at  a  Network  Appliance  filer  between  Eebruary 
S‘h-9th,  2003.  This  tiler  serves  home  directories  for  the  Electrical  Engineering  and  Computer  Science 
Department.  It  sees  an  engineering  workload  of  research,  software  development,  course  work,  and 
WWW  traffic.  Detailed  characterization  of  this  environment  can  be  found  in  [7] . 

DEAS03:  The  DEAS03  trace  captures  NES  traffic  observed  at  another  Network  Appliance  tiler  between 
Eebruary  2003.  This  filer  serves  the  home  directories  of  the  Department  of  Engineering  and 

Applied  Sciences.  It  sees  a  heterogenous  workload  of  research  and  development  combined  with  e- 
mail  and  a  small  amount  of  WWW  traffic.  The  workload  seen  in  the  DEAS03  environment  can  be 
best  described  as  a  combination  of  that  seen  in  the  EECS03  environment  and  e-mail  traffic.  Detailed 
characterization  of  this  environment  can  be  found  in  [6]  and  [7] . 

CAMPUS:  The  CAMPUS  trace  captures  a  subset  of  the  NES  traffic  observed  by  the  CAMPUS  storage 
system  between  October  15'^-28^^k  2001.  The  CAMPUS  storage  system  provides  storage  for  the  e- 
mail,  web,  and  computing  activities  of  10,000  students,  staff,  and  faculty  and  is  comprised  of  fourteen 
53  GB  storage  disk  arrays.  The  subset  of  activity  captured  in  the  CAMPUS  trace  includes  only  the 
traffic  between  one  of  the  disk  arrays  (home02)  and  the  general  e-mail  and  login  servers.  NES  traffic 
generated  by  serving  web  pages,  or  by  students  working  on  CS  assignments  is  not  included.  However, 
despite  the  these  exclusions,  the  CAMPUS  trace  contains  more  operations  per  day  (on  average)  than 
either  the  EECS03  or  DEAS03  trace.  Detailed  characterization  of  this  environment  can  be  found  in  [5] 
and  [6]. 

Due  to  differences  in  the  number  of  operations  seen  in  each  trace,  the  size  restrictions  of  the  database 
used  by  the  simulator  to  store  the  reconstructed  namespace,  and  raw  time  required  for  processing,  we  were 
unable  to  use  the  the  same  time  periods  for  each  trace.  Eor  the  EECS03  trace,  we  reconstructed  the  server 
namespace  using  the  Eebruary  2003  trace  and  performed  simulation  over  the  Eebruary  9'^,  2003  trace. 
Reconstruction  and  simulation  for  the  DEAS03  trace  was  performed  over  the  same  dates  as  the  EECS03 
trace.  Eor  CAMPUS,  we  reconstructed  using  the  October  15'*  to  October  2U',  2001  trace  and  simulated 
using  the  October  22"'^  to  28'*  trace. 

5.2.2  Overall  results 

The  graphs  in  Eigure  4  show  the  aggregate  metadata  accesses  incurred  by  all  clients  in  each  trace  when 
prefetching  using  the  various  algorithms.  The  left-most  graph  shows  the  number  of  total  metadata  accesses 


14 


EECS03 


EECS03 


c/5l00h 

C/1 

S  50^ 

O 

o 

•w 

c3 
Td 

SiooF 

OJ 

s  50F 

-  QL 
oj  u 

^-H 

CJ 

u 

9lOO 

o 

c 

^  50 
o 


- 

y 

5 

BHB 

- 

DEAS03 

- 

■ 

i 

■ 

■ 

- 

CAMPUS 

H 

iBi 

Bh 

■ 

m 

I  Mandatory 

1  1  Non-mandatory 

No  pref.  Random  C.-order  Cousin 


Child 


S100 

OJ 

o 

50 

I 

CJ 

9^00- 

o 

^50 


100h 


o  50- 
C 

O  0 


DEAS03 


CAMPUS 


No  pref.  Random  C.-order  Cousin  Chiid 


Figure  4:  Percentage  of  metdata  accesses  required  by  each  object  ID  assignment  algorithm  as  compared  to  the  case  where 
no  prefetching  is  performed  (lower  is  better).  In  the  leftmost  graph,  metadata  accesses  are  categorized  as  either  mandatory  and 
non-manatory.  Since  mandatory  accesses  cannot  be  elminated,  the  rightmost  graph  non-manatory  accesses  only. 


required  as  a  percent  of  that  required  when  prefetching  is  disabled.  In  this  graph,  the  gray  bar  shows 
the  contribution  by  mandatory  accesses  whereas  the  white  bar  shows  the  contribution  by  non-mandatory 
accesses.  Since  mandatory  accesses  cannot  be  eliminated  via  prefetching,  the  right-most  graph  shows  the 
results  obtained  when  mandatory  accesses  are  excluded.  The  aggregate  results  shown  in  the  graphs  yield 
two  major  results.  First,  use  of  server-driven  metadata  prefetching  and  namespace  flattening  results  in  a 
large  reduction  in  metadata  accesses  in  the  EECS03  trace,  but  the  benefit  these  techniques  offer  is  limited 
in  both  the  DEAS03  trace  and  the  CAMPUS  trace.  Second,  the  create-order  algorithm  does  perform  worse 
than  the  child-closest  and  cousin-closest  algorithms  in  all  of  the  traces,  but  the  difference  is  most  evident 
in  cases  where  the  benefit  derivable  from  prefetching  is  limited;  in  such  cases  the  choice  of  what  items  to 
prefetch  becomes  paramount  and  the  inter-file  relationships  exposed  by  the  child-closest  and  cousin-closest 
algorithms  serve  to  realize  the  small  potential  for  benefit. 

Of  all  the  traces  used  for  our  study,  the  EECS03  trace  is  probably  the  most  representative  of  an  aca¬ 
demic/research  workload,  as  both  DEAS03  and  CAMPUS  are  both  dominated  by  e-mail  traffic.  As  such,  it 
is  worthy  of  note  that  the  EECS03  trace  stands  to  gain  the  most  benefit  from  both  techniques.  Almost  50% 
of  all  metadata  accesses  in  this  trace  are  non-mandatory  and  can  be  eliminated.  This  potential  is  realized 
most  by  the  child-closest  and  cousin-closest  algorithms,  which  both  eliminate  68%  of  them.  Problems  due 
to  fragmentation  do  not  affect  create-order  much  in  this  trace,  as  it  performs  only  slightly  worse  than  the 
two  namespace  flattening  algorithms. 

The  potential  benefit  from  prefetching  is  more  limited  in  the  DEAS03  and  CAMPUS  trace  than  in 
the  EECS03  trace.  However,  the  effects  of  fragmentation  on  the  create-order  algorithm  are  much  more 
visible  on  these  two  traces.  In  DEAS03,  80%  of  all  metadata  accesses  are  mandatory;  this  predominance 
of  mandatory  accesses  results  from  the  large  number  of  temporary  files  (each  of  which  incur  mandatory 
CREATE  and  REMOVE  operations)  seen  in  this  trace.  Of  the  20%  of  accesses  that  are  non-mandatory,  the 
child-closest  and  cousin-closest  algorithms  eliminate  33%,  while  create-order  eliminates  25%. 

Eike  DEAS03,  benefit  is  limited  in  the  CAMPUS  trace  due  to  temporary  file  accesses.  During  peak 
hours,  50%  of  all  files  referenced  in  CAMPUS  trace  are  temporary  lock  files  used  to  coordinate  access  to 
the  inboxes.  Additionally,  many  CAMPUS  users  use  e-mail  applications  that  create  many  temporary  files 
for  e-mail  compositions  [5].  Non-mandatory  accesses  comprise  17%  of  all  metadata  accesses  in  CAMPUS; 
31%  of  these  accesses  are  eliminated  by  the  child-closest  and  cousin-closest  algorithms,  while  only  15%  is 
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#  of  accesses  with  prefetching  disabled 


Figure  5 :  Factor  reduction  in  number  of  metadata  accesses  by  each  client.  Each  log-log  scale  scatter-plot  shows  the  number 
of  metadata  accesses  with  prefetching  disabled  on  the  x-axis  and  the  number  of  metadata  accesses  with  prefetching  enabled  on  the 
y-axis. 

eliminated  by  the  create-order  algorithm. 

This  trace-based  evaluation  shows  that,  though  use  of  server-driven  metadata  prefetching  and  names¬ 
pace  flattening  algorithms  do  yield  noticeable  benefits  on  academic/research  workloads  (e.g.,  they  yield  a 
68%  reduction  in  non-mandatory  metadata  accesses  in  the  EECS03  trace),  the  exact  choice  of  namespace 
flattening  algorithm  is  not  critical.  Two  possible  conclusions  can  be  inferred  from  the  identical  performance 
of  the  child-closest  and  cousin-closest  algorithms  .  Eirst,  prefetching  metadata  of  files  in  the  same  directory 
level  might  be  as  useful  as  prefetching  the  metadata  of  descendants.  Alternatively,  the  access  patterns  seen 
in  the  traces  might  be  such  that  it  is  only  useful  to  prefetch  metadata  for  items  in  the  same  directory  as  the 
demand-fetched  files. 

5.2.3  Individual  client  performance 

The  graphs  in  Eigure  5  show  the  reduction  in  non-mandatory  metadata  accesses  seen  by  each  individual 
client  in  the  traces.  Since  no  differences  were  observable  between  the  child-closest  and  cousin-closest 
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algorithms,  only  the  accesses  incurred  by  the  first  are  shown.  The  graphs  show  that  the  individual  work¬ 
loads  generated  by  various  clients  significantly  impact  the  benefit  obtainable  from  server-driven  metadata 
prefetching  and  namespace  flattening.  For  example,  though  the  aggregate  reduction  in  metadata  accesses  is 
limited  in  the  DEAS03  trace,  many  clients  see  between  a  factor  of  2  to  a  factor  of  4  reduction  in  accesses. 
The  aggregate  benefit  is  limited  because  of  one  client  that  generates  43%  of  all  metadata  accesses  and  sees 
no  benefit  from  prefetching.  Conversely  in  the  EECS03  trace,  most  clients  do  not  see  much  benefit  from 
prefetching.  However,  a  single  client,  which  accounts  for  86%  of  all  metadata  accesses,  sees  a  factor  of  4 
reduction.  Only  seven  clients  are  present  in  the  CAMPUS  trace;  three  of  these  clients  account  for  less  than 
1%  of  all  accesses  combined,  whereas  the  other  four  account  for  25%  each.  The  four  clients  that  account 
for  the  majority  of  accesses  do  not  see  much  benefit  from  prefetching. 

In  summary,  even  though  the  aggregate  reduction  in  metadata  accesses  may  be  limited  for  some  work¬ 
loads,  hence  limiting  metadata  server  scalability,  individual  client  latencies  may  still  see  large  benefits. 

6  Conclusion 

Server-driven  metadata  prefetching  and  namespace  flattening  mitigate  the  small  file  efficiency  problems  of 
objecf-based  storage  sysfems.  Rafher  fhan  having  clienfs  inferacf  wifh  fhe  mefadafa  server  for  each  file,  fhe 
server  provides  mefadafa  for  mulfiple  files  each  lime.  This  reduces  bolh  mefadafa  server  load  and  clienl 
access  lalency.  Namespace  flallening  Iranslales  namespace  localily  info  objecl  ID  similarity,  providing 
hinfs  for  prefelching  and  olher  policies,  enhancing  mefadafa  index  locality,  and  allowing  compacf  range 
represenlalions.  Combined,  Ihese  techniques  should  help  objecf-based  storage  lo  salisfy  a  larger  range  of 
workload  types,  rafher  fhan  being  jusl  for  high-bandwidlh  large  file  workloads. 
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